PR to merge ARM SVE branch #315

rdolbeau · 2023-03-02T17:29:37Z

This is primarily for discussion ad probably need some cleaning up.

The current scheme using SVE (and RISC-V V) doesn't leverage the scalability; instead, it produces sets of codelets for all possible power-of-2 sizes, using masking. Only sets of a size <= to the hardware implementation width are enabled. See here.

It also auto-generate the simd-support/vtw.h file that defines the various VTW macro. This is useful as while SVE is limited to 2048 bits wide vector, RISC-V V is more or less unbounded and there's a least a 16384 bits wide implementation being developed.

So using SVE (or V) creates a fairly large, though versatile, library.

I has been tested in QEmu; using the Arm instruction Emulator; and on real hardware (Fujitsu A64FX, AWS Graviton 3).

rdolbeau · 2023-03-03T07:51:28Z

I've created release package for testing this w/o the 'maintainer mode' requirements. See there: https://github.com/rdolbeau/fftw3/releases/tag/sve-test-release-001

rdolbeau · 2023-12-12T09:41:09Z

@stevengj @matteo-frigo Can you comment on this? In particular the "used fixed-width implementation and enable all those of equal-or-narrower to the hardware width". It's a bit hackish but it's the best we have at this time. RISC-V is using the same principle for now.

ggouaillardet · 2024-02-21T07:39:59Z

I would like to encourage the FFTW folks to consider this PR so it can be included (after some cleanup) into the next FFTW release.

At this stage, my main interest is only about ARM SVE.
SVE enabled processors are currently featured by several vendors:

Fujitsu A64fx (512 bits)
Amazon Graviton 3 (256 bits)
NVIDIA Grace (128 bits)
so there is naturally a growing interest into upstream SVE support in FFTW.

Though SVE vector length can scale from 128 to 2048 bits (via 128 bits increments), only 3 sizes are available today (128, 256 and 512. Fujitsu A64fx does not support 384 bits). So even if vector length agnostic FFTW might be achievable, I do not see this as mandatory because of the currently limited available vector lengths and the fact VLA code is (generally) slightly slower than vector length specific code. If needed, I can provide an option to set a maximum supported vector length in order to reduce the size of the FFTW library.

Bottom line, I approve the approach of the PR and hope it can make its way upstream.

rdolbeau · 2024-02-21T08:08:00Z

Though SVE vector length can scale from 128 to 2048 bits (via 128 bits increments),
only 3 sizes are available today (128, 256 and 512. Fujitsu A64fx does not support 384 bits)

Also, in its current incarnation SVE doesn't allow for non-power-of-2 multiple of 128 bits anymore, only power-of-2 multiple of 128 bit are. This is visible for instance in DDI0487J for the register ZCR_EL1:
"The Non-streaming SVE vector length can be any power of two from 128 bits to 2048 bits inclusive."

So this implementation should cover all cases, and I agree with you that 1024 and 2048 bits are currently only for emulators like QEmu and could (should?) be optional. 512 and 256 could also be optional for site-specific deployment.

rdolbeau · 2024-03-04T10:30:52Z

Done some minor clean-ups, minor improvement, and rebase to current HEAD.

@ggouaillardet if you have the time to confirm this is still OK for you

rdolbeau · 2024-03-04T12:13:36Z

I've created a new package with pre-generated files, so maintainer mode is not required to test this on Arm+SVE, see https://github.com/rdolbeau/fftw3/releases/tag/sve-test-release-002

ggouaillardet · 2024-04-16T06:17:10Z

Thanks @rdolbeau and my apologies for the late reply.

I noted configure --enable-sve works but make fails out of the box with Arm compilers, so I added a test in order to abort at configure instead of failing at make. Feel free to cherry-pick ggouaillardet/fftw3@c32583c

rdolbeau · 2024-04-19T15:17:34Z

I noted configure --enable-sve works but make fails out of the box with Arm compilers

Probably with all compilers that do not enable SVE by default (all of them, probably!), currently configure doesn't enable it explicitly so it has to be done in {C/CXX/F}FLAGS.

so I added a test in order to abort at configure instead of failing at make. Feel free to cherry-pick ggouaillardet/fftw3@c32583c

Good idea. I'll try that and add it to the PR ASAP.

juntangc · 2024-05-03T18:26:32Z

Though SVE vector length can scale from 128 to 2048 bits (via 128 bits increments),
only 3 sizes are available today (128, 256 and 512. Fujitsu A64fx does not support 384 bits)

Also, in its current incarnation SVE doesn't allow for non-power-of-2 multiple of 128 bits anymore, only power-of-2 multiple of 128 bit are. This is visible for instance in DDI0487J for the register ZCR_EL1: "The Non-streaming SVE vector length can be any power of two from 128 bits to 2048 bits inclusive."

So this implementation should cover all cases, and I agree with you that 1024 and 2048 bits are currently only for emulators like QEmu and could (should?) be optional. 512 and 256 could also be optional for site-specific deployment.

I can see performance benefits by extending to 256 bits and above. Do you happen to have performance data for the sve version comparing to the SIMD NEON version?

ggouaillardet · 2024-05-07T05:10:45Z

Here is some performance number I collected a while ago for fftw_plan_dft_1d(signal_length, original_signal, fft_applied_signal, FFTW_FORWARD, FFTW_ESTIMATE)
It shows the performance improvement between NEON and SVE on A64fx processor (512 bits SVE vectors)

Feel free to suggest other/better benchmarks if appropriate.

antoine-morvan · 2024-05-07T09:41:15Z

Hello,

I ran few experiments using the included benchmark tests/bench, built with double support. My commands were binding the execution to the 4th core on each of the machines (arbitrary choice) :

numactl -C 3 ./tests/bench -v2 -r 20 -s $PB_DEF

Where PB_DEF within the few benchmark settings ocf2048, icf32x32, ocf128x128x128, icf256x256x256 .

Baseline stands for the current release 3.3.10 built with --enable-neon, and the SVE gain is this PR built with --enable-neon --enable-sve (on Neoverse N1 only --enable-neon was used). The charts are comparing the "mflops" value output by the benchmark utility.

We can observe significant gain on all SVE machines (A64FX, Neoverse V1 and V2), as well as no degradation on NEON only machine (Neoverse N1).

Best regards.

jlinford · 2024-05-15T11:37:05Z

These results are very exciting. FFTW performance is extremely important to users of NVIDIA Grace (Neoverse V2), which is now the dominant architecture on the Green500. A great many people would be glad to see this commit in the next FFTW release.

keeranroth

Just a couple of nits. Feel free to address them or not. The structure looks good overall, and thanks for the benchmarking results as well. They speak for themsleves

keeranroth · 2024-05-15T14:54:56Z

simd-support/generate_vtw.c

@@ -0,0 +1,84 @@
+#include <stdio.h>
+#include <string.h>
+#include <stdlib.h>


You gate the include later in this file, so remove this

keeranroth · 2024-05-15T14:57:46Z

simd-support/generate_vtw.c

+		unsigned int size = rp2(osize);
+		if (osize != size)


Not really important, but checking for a power of two can be done with:

if (!(osize & (osize - 1))

just a nice tidbit from Hacker's Delight

keeranroth · 2024-05-15T15:34:24Z

simd-support/simd-maskedsve.h

This is not to be included directly (accessed through, e.g. simd-maskedsve1024.h), so should it be named something else, e.g. simd-maskedsve.h.template?

and add simd-support/{generate_vtw.sh,vtw.h} into the dist tarball

ignore all files automatically generated for SVE support

… VBYI.

ADD/SUB/MUL are three-addresses in SVE, but the masked form is only two-adresses. And there's a lot of reuse in FFTW3 (and complex arithmetic). But ACLE/SVE (i.e. intrinsics) don't have the non-masked form :-( So used inline ASM for force the non-masked version to be used. Masked-out lanes should be mostly zero, and are never stored anyway, so computing on them should be fine. This one will be reversed if it's not a performance win.

…mpiler/hardware dependent, and more tests are needed before settling on some defaults.

…the command line

When configure'd with --enable-sve, try to build a sample SVE program and abort on failure, otherwise configure successes but make will fail.

rdolbeau · 2024-07-03T08:38:35Z

I've created a new package with pre-generated files, so maintainer mode is not required to test this on Arm+SVE, see https://github.com/rdolbeau/fftw3/releases/tag/sve-test-release-003

rdolbeau · 2024-07-03T08:40:19Z

@stevengj @matteo-frigo any comment on this? Is this OK to think about merging ?

boegel · 2025-01-17T18:14:31Z

It's been a while, so time to try and get some attention to this again...

Can this be merged, and another FFTW version be released that includes these changes?

rdolbeau · 2025-01-24T10:39:30Z

@stevengj @matteo-frigo any comment on this? Is this OK to merge the branch and gain mor eepxosure and feedback ?

At this stage, lots of positive comments, and the code has been extensively tested on all available SVE width: 128 (on Nvidia Grace / Neoverse V2), 256 (AWS Graviton 3 / Neoverse V1), 512 (Fujitsu A64FX).

tetsuzo-usui · 2025-01-31T07:53:47Z

The performance improvements are shown here for sizes from 512 to 36864 that can be decomposed into factors of 2, 3, 5, and 7. FFTW_MEASURE flag is used to generate the plans and MFLOPS is defined as (5N log_2 N)/t, where t is the time and N is the size of transform.
I previously made another fork of 512-bit-only SVE for Fugaku supercomputer, but it would be preferable to merge this PR into the upstream repository since it supports all available SVE widths. One thing worth noting is that the same library is used for measurements on these different 128, 256, and 512 bit SVE machines.

Best regards.

matteo-frigo · 2025-02-02T00:02:13Z

I merged this pull request but I'd like to change a few things.

What's the deal with VTW_SIZE = 256? It looks like only sizes up to 32 or perhaps 64 are used. Can I ditch the larger sizes?
I don't like vtw.h, I'll think of some macrology to replace it. Something like #define REPEAT32(...) REPEAT16(...); REPEAT16(...) Any objections?
What's the simplest way for me to try the code? I have a pixel8a phone that reports this in /proc/cpuinfo:

Features	: fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bti

Is it good enough?

matteo-frigo · 2025-02-02T01:35:32Z

@rdolbeau @tetsuzo-usui I have pushed some cleanups onto master, can you check that I haven't broken anything?

rdolbeau · 2025-02-02T08:18:04Z

I merged this pull request but I'd like to change a few things.
1. What's the deal with VTW_SIZE = 256? It looks like only sizes up to 32 or perhaps 64 are used. Can I ditch the larger sizes?

This was expanded to support RISC-V V in addition to Arm SVE. They both use the same "one set of codelets by sizes + masking" idea, up to the maximum architecture size.

SVE is architecturally limited to 2048 bits, which is what is supported by the current version. 512 bits (A64FX), 256bits (Neoverse V1), 128 bits (Neoverse V2, Neoverse N2, porbably most future implementations) are the one available and tested. 1024 and 2048 had limited testing in Arm Instruction Emulator. & QEmu (but in non-tming-accurate emulation, anything other than estimate is hard to fully test) [non-power-of-2 sizes were dropped from specifications].
RISC-V has no set limits though some instructions may have fuzzy semantic beyond a certain point. For V1.0 (the standardized version, there was some V0.7 hardware) I have silicon with 256 bits SIMD (Spacemit K1 in Banana Pi F3), but there's experimental hardware with larger sizes. The 256 comes from the vector unit developed by some academic partners of the European Processor Initiative project (EPI), which is 16384 bits - that's 256 double-precision scalar or 256 single-precision complex...

2. I don't like vtw.h, I'll think of some macrology to replace it.  Something like `#define REPEAT32(...) REPEAT16(...); REPEAT16(...)`   Any objections?

No, it's just auto-generated from a script because that was easy to do. Macros are fine if they produce the same arrays.

3. What's the simplest way for me to try the code?  I have a pixel8a phone that reports this in /proc/cpuinfo:
̀ Features : fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm jscvt fcma lrcpc dcpop sha3 sm3 sm4 asimddp sha512 sve asimdfhm dit uscat ilrcpc flagm ssbs sb paca pacg dcpodp sve2 sveaes svepmull svebitperm svesha3 svesm4 flagm2 frint svei8mm svebf16 i8mm bti ```

There's SVE (presumably 128 bits), so if you can compile/test FFTW3 it should be OK hardware-wise (only regular SVE is needed, which is all that is supported on the A64FX & Neoverse V1, other CPUs also have SVE2 but that doesn't bring much to the table for numerical codes). You probably want a recent version of GCC or LLVM to check the performance. There's a known "issue" with the way the code is written: as the same source is used for all sizes and use masking, only masked instructions are used. But in SVE, some masked instructions (including fadd and fsub!) are two-operands destructive instead of three-operands. :-( Recent compilers can optimized away the predicate and use non-masked instructions when it is safe to do, which is more efficient.

Other than that, server-class, you can test in the cloud(s): AWS Graviton 3 (Neoverse V1), AWS Graviton 4 (Neoverse V2), Microsoft Cobalt 100 (Neoverse N2) should all be good. NVidia Grace (Neoverse V2) hardware can be bought for on-premise use and is also good. Ampere Altra (Neoverse N1) is NOT - no SVE there. Anything else with FEAT_SVE should be good as well, but as far as I know no Apple hardware support SVE.

Also, of interest to anyone looking at NEON vs. SVE: this hardware has FEAT_FCMA, along all other hardware I know of with FEAT_SVE. That means it can use the complex FP instructions in NEON (fcmla, fcadd), but the current code in NEON doesn't use them as FEAT_FCMA is a recent addition to NEON. I did a quick test on a Neoverse V2, and adding FEAT_FCMA support in NEON does help the NEON code. However, that would require two sets of NEON codelets to have a "universal" library, which may not be desirable. As SVE offers access to those instructions anyway, I didn't make a clean version of the patch as I don't think it's needed, but if someone has a a different opinion I can dig it up.

matteo-frigo · 2025-02-02T19:43:17Z

Thanks.

I have now mostly unified the SVE and AVX* macros for the twiddle factor storage, so things should be simpler next time.

The code works on my phone (128-bit sve).

I'll try the cloud next.

matteo-frigo · 2025-02-02T21:39:56Z

@rdolbeau I started a m8g.large aws instance. It has SVE but only 128-bit. Is that expected? Should I read your message as implying that Graviton 3 has 256-bit SVE and Graviton 4 has 128-bit SVE?

rdolbeau · 2025-02-02T21:56:28Z

@rdolbeau I started a m8g.large aws instance. It has SVE but only 128-bit. Is that expected?
Should I read your message as implying that Graviton 3 has 256-bit SVE and Graviton 4 has 128-bit SVE?

Yes and yes. m8g is Graviton 4 (https://aws.amazon.com/ec2/instance-types/m8g/), using the Neoverse V2 core, which has 128 bits SVE (and four pipelines of it).

Graviton 3 uses the Neoverse V1 core which has 256 bits SVE (two pipelines, which are the same as the four NEON pipelines but aggregated in pairs). Those would be e.g. c7g or m7g.

For both systems, the theoretical Flops throughput is the same in NEON and SVE (16 flops/cycles incl. FMA),, and only twice the scalar throughput (8 flops per cycle incl. FMA). A64FX is the one where SVE give the most theoretical benefits (only two pipeleines like on Intel systems, 512 bits max, so 32 flops/cycle incl. FMA in SVE, 8 flops/cycle incl. FMA in NEON, and 4 flops/cycle incl. FMA in scalar). fcmla/fcadd are a significant help to performance in SVE vs. the current NEON code, as mentioned above.

matteo-frigo · 2025-02-02T22:04:01Z

@rdolbeau @tetsuzo-usui What's the right way to handle the cycle counter these days on arm64?

It looks like one has to configure with --enable-armv8-cntvct-el0 Is this supported on all armv8 so that we can enable it automatically? Is there some intrinsic nowadays that does the job?

rdolbeau · 2025-02-03T06:48:17Z

@rdolbeau @tetsuzo-usui What's the right way to handle the cycle counter these days on arm64?

It looks like one has to configure with --enable-armv8-cntvct-el0 Is this supported on all armv8 so that we can enable it automatically? Is there some intrinsic nowadays that does the job?

I think it is architecturally defined and therefore should be OK on all Arm v8 (and v9, which is just a renaming of v8), so could be set as default on v8. --enable-armv8-pmccntr-el0 is much more specific and requires kernel enablement.

I don't know of an intrinsic to access the counter directly, but it might exist in some compilers? IIRC it's just a line ot two of ASM in the current code.

tetsuzo-usui · 2025-02-07T08:27:54Z

@rdolbeau @tetsuzo-usui I have pushed some cleanups onto master, can you check that I haven't broken anything?

The latest FFTW in the current master branch has been tested for numerical results and performance on Grace, Graviton 3 and A64FX, and all are fine.

About 1,000 randomly selected tests were run with "make check", and all passed in double and single precision on all machines. An excerpt of the output log is shown below:

perl -w ./check.pl  -r -c=30 -v `pwd`/bench
--------------------------------------------------------------
         FFTW transforms passed basic tests!
--------------------------------------------------------------
perl -w ./check.pl  -r -c=30 -v --nthreads=2 `pwd`/bench
perl -w ./check.pl  -r -c=5 -v --threads_callback --nthreads=2 `pwd`/bench
--------------------------------------------------------------
         FFTW threaded transforms passed basic tests!
--------------------------------------------------------------

The difference in average performance of sizes from 512 to 36864 compared to Dolbeau's PR performance is about 1% on each platform, which is not a large difference beyond measurement error.

rdolbeau mentioned this pull request Mar 13, 2023

Add RISC-V vector spec v1.0 support #279

Open

rdolbeau force-pushed the arm-sve-clean branch from 23265b6 to 12608ad Compare March 4, 2024 10:23

rdolbeau mentioned this pull request Mar 4, 2024

FFTW SIMD Support #336

Open

keeranroth reviewed May 15, 2024

View reviewed changes

rdolbeau and others added 13 commits July 3, 2024 07:07

Clean rebuild of the arm-sve-alt branch (at 3b1a5c7)

0a3490c

simd/support/generate_vtw: fix include files for OSX

bfe9886

automatically generate simd-support/vtw.h in bootstrap.sh

867eaec

and add simd-support/{generate_vtw.sh,vtw.h} into the dist tarball

update .gitignore

14084d0

ignore all files automatically generated for SVE support

sve: correctly support negative offsets

1348189

sve: go brrr

38ca2c6

Remove test&branch (ovs==0) in STu, replace by masking. Also, improve…

f85c7f7

… VBYI.

Make some variants based on #define, as their behavior seems to be co…

b0bf4c6

…mpiler/hardware dependent, and more tests are needed before settling on some defaults.

oups, missing ASM for SP

e315002

disable USE_UNMASKED_ASSEMBLY by default so it can be reenabled from …

eefa1b4

…the command line

improve VZMULI[j], clean-up old code

8cb2fbd

try building a sample SVE program

ff3dfb0

When configure'd with --enable-sve, try to build a sample SVE program and abort on failure, otherwise configure successes but make will fail.

rdolbeau force-pushed the arm-sve-clean branch from 12608ad to ff3dfb0 Compare July 3, 2024 07:08

This was referenced Jul 3, 2024

SVE Imlementation #355

Closed

Occasional failures in MPI part of the unit tests on ARM neoverse_v1 #334

Closed

rdolbeau added the enhancement label Jul 14, 2024

rdolbeau changed the title ~~Preliminary PR to merge ARM SVE branch~~ PR to merge ARM SVE branch Jan 24, 2025

matteo-frigo merged commit 33dded1 into FFTW:master Feb 1, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR to merge ARM SVE branch #315

PR to merge ARM SVE branch #315

rdolbeau commented Mar 2, 2023

rdolbeau commented Mar 3, 2023

rdolbeau commented Dec 12, 2023

ggouaillardet commented Feb 21, 2024

rdolbeau commented Feb 21, 2024

rdolbeau commented Mar 4, 2024

rdolbeau commented Mar 4, 2024

ggouaillardet commented Apr 16, 2024

rdolbeau commented Apr 19, 2024

juntangc commented May 3, 2024 •

edited

Loading

ggouaillardet commented May 7, 2024

antoine-morvan commented May 7, 2024 •

edited

Loading

jlinford commented May 15, 2024

keeranroth left a comment

keeranroth May 15, 2024

keeranroth May 15, 2024

keeranroth May 15, 2024

rdolbeau commented Jul 3, 2024

rdolbeau commented Jul 3, 2024

boegel commented Jan 17, 2025

rdolbeau commented Jan 24, 2025

tetsuzo-usui commented Jan 31, 2025

matteo-frigo commented Feb 2, 2025

matteo-frigo commented Feb 2, 2025

rdolbeau commented Feb 2, 2025 •

edited

Loading

matteo-frigo commented Feb 2, 2025

matteo-frigo commented Feb 2, 2025

rdolbeau commented Feb 2, 2025 •

edited

Loading

matteo-frigo commented Feb 2, 2025

rdolbeau commented Feb 3, 2025

tetsuzo-usui commented Feb 7, 2025

PR to merge ARM SVE branch #315

PR to merge ARM SVE branch #315

Conversation

rdolbeau commented Mar 2, 2023

rdolbeau commented Mar 3, 2023

rdolbeau commented Dec 12, 2023

ggouaillardet commented Feb 21, 2024

rdolbeau commented Feb 21, 2024

rdolbeau commented Mar 4, 2024

rdolbeau commented Mar 4, 2024

ggouaillardet commented Apr 16, 2024

rdolbeau commented Apr 19, 2024

juntangc commented May 3, 2024 • edited Loading

ggouaillardet commented May 7, 2024

antoine-morvan commented May 7, 2024 • edited Loading

jlinford commented May 15, 2024

keeranroth left a comment

Choose a reason for hiding this comment

keeranroth May 15, 2024

Choose a reason for hiding this comment

keeranroth May 15, 2024

Choose a reason for hiding this comment

keeranroth May 15, 2024

Choose a reason for hiding this comment

rdolbeau commented Jul 3, 2024

rdolbeau commented Jul 3, 2024

boegel commented Jan 17, 2025

rdolbeau commented Jan 24, 2025

tetsuzo-usui commented Jan 31, 2025

matteo-frigo commented Feb 2, 2025

matteo-frigo commented Feb 2, 2025

rdolbeau commented Feb 2, 2025 • edited Loading

matteo-frigo commented Feb 2, 2025

matteo-frigo commented Feb 2, 2025

rdolbeau commented Feb 2, 2025 • edited Loading

matteo-frigo commented Feb 2, 2025

rdolbeau commented Feb 3, 2025

tetsuzo-usui commented Feb 7, 2025

juntangc commented May 3, 2024 •

edited

Loading

antoine-morvan commented May 7, 2024 •

edited

Loading

rdolbeau commented Feb 2, 2025 •

edited

Loading

rdolbeau commented Feb 2, 2025 •

edited

Loading